main feature of interest in the dataset.
The main feature of interest is the ‘quality’ and investigate which factors determine the quality of a wine.
title: Red Wine Analysis by R author: Kholood Alsaggaf
Abstract: an analysisof Red Wine Dataset has been conducted to understand the responsible variables for the quality of the wine. by finding the correlation between them and the Wine Quality with other factors.in conclusion predict the outcome of a test set data by a linear model.
========================================================
About the data: This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine.
## 'data.frame': 1599 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
## $ rating : Ord.factor w/ 3 levels "bad"<"average"<..: 2 2 2 2 2 2 2 3 3 2 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality rating
## Min. : 8.40 3: 10 bad : 63
## 1st Qu.: 9.50 4: 53 average:1319
## Median :10.20 5:681 good : 217
## Mean :10.42 6:638
## 3rd Qu.:11.10 7:199
## Max. :14.90 8: 18
First, plot the distribution of each variable to get an idea of the data, then observe the distribution shape. lastly remove the extreme outliers to get a true clear analysis.
Observations:Fixed Acidity distribution is positively skewed, median is around 8 with high concentration of wines with Fixed Acidity. the plots has been modified to exclude extreme outliers.
Observations: Volatile acidity distribution is Bimodal with two peaks at 0.4 and 0.6.
Observations: Citric acid has no clear visiual distribution, there is somthing wrong with the data.
Observations: Residual Sugar distribution is positively skewed with high peaks at around 2 and many outliers at the higher ranges.
Observations: Chlorides distribution is positively skewed. the plots has been modified to exclude extreme outliers.
Observations: Free Sulphur Dioxide distribution is positively skewed, there is a high peak at 7 but it continue the same positively skewed patterns with outliers in the high range.
Observations: Total Sulphur Dioxide distribution is also positively skewed.
Observations: Density has Normal Distribution.
Observations: pH distributetion is a Normally distributetion.
Observations: Sulphates distribution is also positively skewed, with few outliers.
Observations: Alcohol has kind of positive skewed distribution but the skewness is less than the above.
observation: most of the wines in the dataset are average quality wines. we aren’t sure if the data accurate and complete, because good quality and the poor quality wines are almost like outliers.
The Red Wine Dataset had 1599 rows and 13 columns originally, the number of columns became 14 after adding a new column called ‘rating’, ‘quality’ is a categorical variable, and the rest of the variables are numerical variables which reflect the physical and chemical properties of the wine. From what we have observed, the most of the wines are ‘average’ quality with very few ‘bad’ and ‘good’, the challenge is to build the right predictive model when there isn’t enough data for the Good Quality and the Bad Quality wines.
The main feature of interest is the ‘quality’ and investigate which factors determine the quality of a wine.
The acidity which is fixed, volatile or citric changes the quality of the wine based on their values, as well as the pH may have some effect on the quality, also the residual sugar may have an effect on the wine quality because sugar determines the sweetness of the wine and may affect the wine taste.
converting quality from Int to Factor and then added new column called ‘rating’ based on ‘quality’.
Citric acid has no clear visiual distribution as compared to the rest numeric variables, there is somthing wrong with the data as if it’s an incomplete data collection.
This is a correlation table between dataset variables to see which varibles may be correlated with each other.
##
## ---------------------------------------------------------------------------
## fixed.acidity volatile.acidity citric.acid
## -------------------------- --------------- ------------------ -------------
## **fixed.acidity** 1 -0.2561 **0.6717**
##
## **volatile.acidity** -0.2561 1 **-0.5525**
##
## **citric.acid** **0.6717** **-0.5525** 1
##
## **residual.sugar** 0.1148 0.001918 0.1436
##
## **chlorides** 0.09371 0.0613 0.2038
##
## **free.sulfur.dioxide** -0.1538 -0.0105 -0.06098
##
## **total.sulfur.dioxide** -0.1132 0.07647 0.03553
##
## **density** **0.668** 0.02203 **0.3649**
##
## **pH** **-0.683** 0.2349 **-0.5419**
##
## **sulphates** 0.183 -0.261 **0.3128**
##
## **alcohol** -0.06167 -0.2023 0.1099
##
## **quality** 0.1241 **-0.3906** 0.2264
## ---------------------------------------------------------------------------
##
## Table: Table continues below
##
##
## ------------------------------------------------------------------------------
## residual.sugar chlorides free.sulfur.dioxide
## -------------------------- ---------------- ------------ ---------------------
## **fixed.acidity** 0.1148 0.09371 -0.1538
##
## **volatile.acidity** 0.001918 0.0613 -0.0105
##
## **citric.acid** 0.1436 0.2038 -0.06098
##
## **residual.sugar** 1 0.05561 0.187
##
## **chlorides** 0.05561 1 0.005562
##
## **free.sulfur.dioxide** 0.187 0.005562 1
##
## **total.sulfur.dioxide** 0.203 0.0474 **0.6677**
##
## **density** **0.3553** 0.2006 -0.02195
##
## **pH** -0.08565 -0.265 0.07038
##
## **sulphates** 0.005527 **0.3713** 0.05166
##
## **alcohol** 0.04208 -0.2211 -0.06941
##
## **quality** 0.01373 -0.1289 -0.05066
## ------------------------------------------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------------------------------------
## total.sulfur.dioxide density pH
## -------------------------- ---------------------- ------------- -------------
## **fixed.acidity** -0.1132 **0.668** **-0.683**
##
## **volatile.acidity** 0.07647 0.02203 0.2349
##
## **citric.acid** 0.03553 **0.3649** **-0.5419**
##
## **residual.sugar** 0.203 **0.3553** -0.08565
##
## **chlorides** 0.0474 0.2006 -0.265
##
## **free.sulfur.dioxide** **0.6677** -0.02195 0.07038
##
## **total.sulfur.dioxide** 1 0.07127 -0.06649
##
## **density** 0.07127 1 **-0.3417**
##
## **pH** -0.06649 **-0.3417** 1
##
## **sulphates** 0.04295 0.1485 -0.1966
##
## **alcohol** -0.2057 **-0.4962** 0.2056
##
## **quality** -0.1851 -0.1749 -0.05773
## -----------------------------------------------------------------------------
##
## Table: Table continues below
##
##
## -------------------------------------------------------------------
## sulphates alcohol quality
## -------------------------- ------------ ------------- -------------
## **fixed.acidity** 0.183 -0.06167 0.1241
##
## **volatile.acidity** -0.261 -0.2023 **-0.3906**
##
## **citric.acid** **0.3128** 0.1099 0.2264
##
## **residual.sugar** 0.005527 0.04208 0.01373
##
## **chlorides** **0.3713** -0.2211 -0.1289
##
## **free.sulfur.dioxide** 0.05166 -0.06941 -0.05066
##
## **total.sulfur.dioxide** 0.04295 -0.2057 -0.1851
##
## **density** 0.1485 **-0.4962** -0.1749
##
## **pH** -0.1966 0.2056 -0.05773
##
## **sulphates** 1 0.09359 0.2514
##
## **alcohol** 0.09359 1 **0.4762**
##
## **quality** 0.2514 **0.4762** 1
## -------------------------------------------------------------------
Quality are Volatile Acidity and Alcohol strongly correlated.
Density has a very strong correlation with Fixed Acidity.
Volatile acidity has a positive correlation with pH.
Alcohol has negative correlation with density.
These are a Box plots between the variables.
The fixed acidity mean and median values doesn’t changes with the increase in quality, so fixed acidity has no effect on quality.
Volatile acid have a negative correlation with quality, so if volatile acid level increase the quality of the wine decrease.
Citric acid have a positive correlation with Wine Quality. when citric acid increase the wine qality increases.
Residual Sugar has no impact on the quality of the Wine. The mean values for the residual sugar is almost the same.
Chlorides has negative corrlatin with quality, whenever Chlorides decrease the quality increase.
we noticed that decreases of Free Sulphur Dioxide produces poor wine and increases of Sulphur Dioxide produces average wine.
good quality wines looks like they have lower densities.
decreases in pH preduces better wine, but there are a few outliers here, therefor we need to see how acids affects pH.
The three plots has negative correlation on pH except volatile acidity, but acidity has a negative correlation with pH how’s that possible!. Let’s investigate.
Simpson’s paradox was responsible for the trend reversal of Volatile Acid vs pH.
whenever Sulphates increases the quality become better.
It seems better wines have higher Alcohol content, but there is high outliers that affect the result, so it might be that alcohol alone doesn’t affecte good quality wine. A linear model will help to get it clear.
##
## Call:
## lm(formula = as.numeric(quality) ~ alcohol, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8442 -0.4112 -0.1690 0.5166 2.5888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.12503 0.17471 -0.716 0.474
## alcohol 0.36084 0.01668 21.639 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared: 0.2267, Adjusted R-squared: 0.2263
## F-statistic: 468.3 on 1 and 1597 DF, p-value: < 2.2e-16
According to R-squared value it shows that alcohol alone affect only 22% of Wine quality, so there must be other variables that affects the quality.
plot correlation test against each variable to the wine quality.
## fixed.acidity volatile.acidity citric.acid
## 0.12405165 -0.39055778 0.22637251
## log10.residual.sugar log10.chlordies free.sulfur.dioxide
## 0.02353331 -0.17613996 -0.05065606
## total.sulfur.dioxide density pH
## -0.18510029 -0.17491923 -0.05773139
## log10.sulphates alcohol
## 0.30864193 0.47616632
these variables have higher correlation to Wine Quality. 1. Alcohol 2. Sulphates(log10) 4. Citric Acid
Based on what observed that Alcohol has a strong effect at the quality, will investigate and try to insert more variables to show if they contribute to the overall quality.
The plot shows that correlation of density with quality was due to alcohol percent as showen in the plot density doesn’t have a clear effect in changing the quality.
Wines with higher alcohol content and higher level of Sulphates produce better wine.
less concentration of volatile acid and higher concentration of alcohol produces better wines.
low pH and high Alcohol percentage produces better wines.
No correlation between residual sugar and quality.
There are few high outliers for better wine with high Sulphur Dioxidelower but mostly Sulphur Dioxide produces better wine.
Now will investigate the effect of acids on quality of wines.
Higher Citric Acid and low Volatile Acid produces better Wines.
not clear correlations.
not clear correlations.
Now will create a linear model with the variables which are most strongly correlated with the quality of the wine.
set.seed(1221)
training_data <- sample_frac(wine, .6)
test_data <- wine[ !wine$X %in% training_data$X, ]
m1 <- lm(as.numeric(quality) ~ alcohol, data = training_data)
m2 <- update(m1, ~ . + sulphates)
m3 <- update(m2, ~ . + volatile.acidity)
m4 <- update(m3, ~ . + citric.acid)
m5 <- update(m4, ~ . + fixed.acidity)
m6 <- update(m2, ~ . + pH)
mtable(m1,m2,m3,m4,m5,m6)
##
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = training_data)
## m2: lm(formula = as.numeric(quality) ~ alcohol + sulphates, data = training_data)
## m3: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity,
## data = training_data)
## m4: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity +
## citric.acid, data = training_data)
## m5: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity +
## citric.acid + fixed.acidity, data = training_data)
## m6: lm(formula = as.numeric(quality) ~ alcohol + sulphates + pH,
## data = training_data)
##
## ====================================================================================================
## m1 m2 m3 m4 m5 m6
## ----------------------------------------------------------------------------------------------------
## (Intercept) 0.155 -0.273 0.866*** 0.973*** 0.497 1.494**
## (0.220) (0.224) (0.247) (0.254) (0.287) (0.515)
## alcohol 0.333*** 0.320*** 0.286*** 0.284*** 0.296*** 0.339***
## (0.021) (0.021) (0.020) (0.020) (0.020) (0.021)
## sulphates 0.855*** 0.599*** 0.650*** 0.667*** 0.733***
## (0.126) (0.124) (0.127) (0.126) (0.129)
## volatile.acidity -1.153*** -1.279*** -1.352***
## (0.124) (0.143) (0.144)
## citric.acid -0.231 -0.629***
## (0.132) (0.174)
## fixed.acidity 0.058***
## (0.017)
## pH -0.569***
## (0.149)
## ----------------------------------------------------------------------------------------------------
## R-squared 0.209 0.245 0.308 0.310 0.319 0.256
## adj. R-squared 0.208 0.243 0.306 0.307 0.315 0.254
## sigma 0.707 0.691 0.662 0.661 0.657 0.686
## F 252.335 155.125 141.769 107.317 89.264 109.700
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1027.549 -1004.996 -963.139 -961.610 -955.575 -997.782
## Deviance 478.652 456.660 418.487 417.154 411.937 449.841
## AIC 2061.098 2017.992 1936.279 1935.219 1925.150 2005.565
## BIC 2075.695 2037.456 1960.608 1964.415 1959.211 2029.894
## N 959 959 959 959 959 959
## ====================================================================================================
linear models were created for the dataset, from what observed alcohol contributes only 22% of the Wine quality and most of the factors converged on Average quality wines. This can be due to the fact that the dataset comprised mainly of ‘Average’ quality wines and few data about ‘Good’ and ‘Bad’ quality wines. the linear model equations produced has low confidence level due to the low R squared value. It’s difficult to predict statistics for incomplete dataset.
from what we observed, Alcohol and Sulphates has stronge effect in determining alcohol quality. Also the linear model shows the variation in the error percentage with different qualities of Wine.
The higher alcohol percentage, the better the wine quality, so alcohol percentage has stronge effect in determining the quality of Wines. Even though most of the factors converged are on Average quality wines, a very high value of median in the best quality wines means that almost all points have a high percentage of alcohol. But alcohol is not the only factor that is responsible for the improvement in quality as we saw in linear model.
from what observed in the plot, High alcohol contents and high sulphate concentrations produces better wines. the slight downwards slope in best quality wines maybe due to the percentage of alcohol slightly greater than the concentration of Sulphates.
df <- data.frame(
test_data$quality,
predict(m5, test_data) - as.numeric(test_data$quality)
)
names(df) <- c("Quality", "Error")
ggplot(data=df, aes(x=Quality,y=Error)) +
geom_jitter(alpha = 0.3) +
ggtitle("Linear model errors vs. expected quality")
The plot shows that error is clearly intense in the ‘Average’ quality section than ‘Good’ and ‘Bad’ quality wines which indecates the fact that most of our dataset contains ‘Average’ quality wines. The linear model with R squared value for m5 explain around 33% change in quality, and due to the lack of information the earlier models isn’t the best model to predict both ‘Good’ and ‘Bad’ quality wines.
In conclusion, what we have perfomed in the analysis process are first create plots for different variables against the quality to understand the relationships between them and then investaget and find out the correlation between them and wine quality, we found that the factors which mostly affectes the quality of the wine were Alcohol percentage, Sulphate and Acid concentrations. We also found an interesting phenomenon where volatile acidity had a unexpected positive correlation with pH and we found out that this was due to the Simpson’s Paradox. Then we investaget more to finlize the analysis by creating a multivariate plots to find a combinations of variables which affecteing the overall wine quality.
The main struggle in this dataset analysis was to get a higher confidence level on predicting factors that are effecting the different quality of wines especially the ‘Good’ and the ‘Bad’ since the data was very centralized around the ‘Average’ quality, the training set have an incomplete data which makes it difficult to build an accurate model. From what we observed some wines contains citric acid and others doesn’t. we relized that citric acid is added to some wines in order to increase the acidity, that’s why some wines showed almost a rectangular distribution.
Insights in the future analysis, I hope to have a complete dataset to helpe better in predicting the higher range values and an aqurate modles.